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In this paper we provide a provably convergent algorithm for the 
multivariate Gaussian Maximum Likelihood version of the Behrens- 
Fisher Problem. Our work builds upon a formulation of the log- 
likelihood function proposed by Buot and Richards [5]. Instead of 
focusing on the first order optimality conditions, the algorithm aims 
directly for the maximization of the log-likelihood function itself to 
achieve a global solution. Convergence proof and complexity esti- 
mates are provided for the algorithm. Computational experiments 
illustrate the applicability of such methods to high-dimensional data. 
We also discuss how to extend the proposed methodology to a broader 
class of problems. 

We establish a systematic algebraic relation between the Wald, 
Likelihood Ratio and Lagrangian Multiplier Test (W > LR > LM) 
in the context of the Behrens-Fisher Problem. Moreover, we use our 
algorithm to computationally investigate the finite-sample size and 
power of the Wald, Likelihood Ratio and Lagrange Multiplier Tests, 
which previously were only available through asymptotic results. The 
methods developed here are applicable to much higher dimensional 
settings than the ones available in the literature. This allows us to 
better capture the role of high dimensionality on the actual size and 
power of the tests for finite samples. 

1. Introduction. The so-called Behrens-Fisher Problem may be straight- 
forwardly stated as follows. 

Given two independent random samples Xi, . . . ,-XiVj and Yi,...,Yiv 2 , test 
whether their respective population means pi\ and ^12 coincide in the case 
where their covariances Ei and £2 are unknown. 
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Despite the deceiving simplicity of its form, this problem has motivated 
a wealth of literature that began with the original works of Behrens [1] and 
Fisher [9, 10], and includes Welch [34, 35], Scheffe [26, 27], Yao [37], Robbins, 
Simons and Starr [23], Subrahmanian and Subrahmanian [33] and Cox [8], 
to name a few. For a review of the solutions for the BFP, see, for instance, 
Stuart and Ord [32] and Kim and Cohen [15]. The proposed solutions involve 
a myriad of different approaches, ranging from fiducial inference to Bayesian 
techniques. 

In this paper we are interested in the classical multivariate version of 
the Behrens-Fisher Problem under Normality. In other words, Xj, Yj above 
should be interpreted as d-dimensional Gaussian random vectors with (vec- 
tor) means fi\ and fj,2, and Si, £2 as their respective d x d covariance ma- 
trices, where the sample sizes are greater than d. The sample covariance 
matrices are then positive definite (and thus invertible) with probability 
one if the true covariance matrices Si and £2 are positive definite. Several 
applied problems can be formulated as Behrens-Fisher Problems (in partic- 
ular, for high dimension) in diverse areas such as Speech Recognition (e.g., 
Chien [6]), Quality Control (e.g., Murphy [21]), Development Economics 
(e.g., Schramm, Renn and Biles [28]) and others. 

In this context, the Likelihood Ratio Test is a natural choice in face of the 
well-known asymptotic behavior of the test statistic. It turns out, though, 
that the maximization of the log-likelihood function without restrictive as- 
sumptions on the covariances (e.g., Si = £2) is a nontrivial matter. In gen- 
eral, explicit solutions to the maximization procedure do not exist, and due 
to nonconcavities in the objective function, the solution to the system of 
first order likelihood equations can lead to local optima, as shown in Buot 
and Richards [5]. Numerical algorithms are available in the literature (see, 
e.g., Mardia, Kent and Bibby [20] and Buot and Richards [5]), but their 
convergence properties are unknown. 

The purpose of this paper is two-fold. First, to propose a provably con- 
vergent algorithm, called Cutting Lines Algorithm (CLA), for the Gaussian 
Maximum Likelihood Behrens-Fisher Problem (BFP, for short). Second, 
to use the algorithm to investigate the finite sample properties — size and 
power — of the Likelihood Ratio Test and of the asymptotically equivalent 
Wald and Lagrange Multiplier Tests in the context of the BFP. Such prop- 
erties are generally unknown, especially in high-dimensional contexts. 

The CLA avoids the trap of local maxima, which haunts most approaches 
in the literature, by aiming directly for the maximization of the log-likelihood 
function itself. For this purpose, we make use of the expression for the log- 
likelihood function recently proposed by Buot and Richards [5], which is 
particularly suitable for numerical methods. 

The general maximization strategy may be schematically characterized as 
follows: 
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(i) Lift the log-likelihood maximization problem into a higher-dimensional 
setting by adding artificial variables and constraints. This new problem, the 
Lifted BFP, has the same solution as the original BFP; 

(ii) Create a family of convex modifications (subproblems) of the Lifted 
BFP which we call Ellipsoidal Mean Estimation Problems (EMEP); 

(iii) Solve a sequence of EMEP whose solutions (estimators of the mean) 
converge to the global solution of the Lifted BFP, that is, the proper maxi- 
mum likelihood estimator of the mean. 

Step (i) is a common procedure in Continuous Optimization when one 
wishes to find a simpler (but equivalent) description for the problem in a 
higher-dimensional setting. 

Step (ii) generates a family of convex problems which is computationally 
tractable (in particular, first order conditions are not only necessary but also 
sufficient). In fact, due to the particular structure of the EMEP, we are able 
to propose a specialized method which solves each problem in this family 
very efficiently both theoretically and in (computational) practice. 

Step (iii) plays the crucial role of avoiding local maxima to ensure the 
global optimality. To achieve that, the algorithm relies on the particular 
geometry of the nonconvexities associated with the problem. Such geome- 
try allows for the construction of a sequence of approximations (based on 
supporting lines) to the log-likelihood function itself which can be efficiently 
optimized. We prove that the proposed method converges to a global solu- 
tion. Furthermore, a simulation study provides strong numerical evidence of 
the suitability of the CLA for solving high-dimensional problems. Problems 
with dimension up to 1000 were solved in a couple of minutes. 

We are particularly interested in the finite-sample properties of the Wald, 
Likelihood Ratio and Lagrange Multiplier Tests. We show that their respec- 
tive test statistics satisfy systematic algebraic inequalities in the context of 
the BFP (such a result is known for classical linear models; see Savin [25], 
Berndt and Savin [2], and Breusch [4]). However, the CLA makes it possible 
to go one step further and provide a Monte Carlo study of the actual size and 
the power of such tests. Our results illustrate that the Wald Test is the most 
sensitive among the three to the impact of dimensionality, followed by the 
Likelihood Ratio Test. Especially when the sample size is (relatively) small 
with respect to the dimension, the Wald and the Likelihood Ratio Tests 
tend to over-reject the null hypothesis when we use the x 2 quantiles given 
by Wilks' Theorem. In contrast, the observed size of the Lagrange Multi- 
plier Test seems to be rather robust with respect to dimensionality, with a 
slight tendency to under-reject the null hypothesis. Perhaps not surprisingly, 
these properties carry over to the power of the tests: for fixed sample sizes, 
the Wald Test displays higher power than the Likelihood Ratio Test, which 
in turn displays higher power than the Lagrange Multiplier Test. However, 
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the similar shapes of the observed power curves of the three tests seem to 
suggest that, with appropriate test size adjustment, the three tests may end 
up showing similar power properties. We also applied the Bartlett correc- 
tion to the Likelihood Ratio Test as proposed by Yanagihara and Yuan [36] . 
The corrected test tends to under-reject the null-hypothesis, especially for 
high-dimensional data. Accordingly, it usually displays lower power than the 
Lagrange Multiplier Test. 

In recent years, interesting applied problems have been found which can be 
formulated, in a generic sense, in the framework of the Behrens-Fisher Prob- 
lem under high dimension and low sample size (see, e.g., Srivastava [31]). 
However, in this case no tests invariant under nonsingular linear transfor- 
mations exist (see Srivastava [30] and references therein). Thus, the classical 
Maximum Likelihood formulation of the Behrens-Fisher Problem does not 
seem appropriate. The case of high dimension and low sample size should 
probably be handled by different techniques (or through a nontrivial trans- 
formation to a new Behrens-Fisher Problem), and is a topic for future re- 
search. 

The paper is organized as follows. Section 2 recasts the log-likelihood 
maximization problem as a nonconvex programming problem, and intro- 
duces the EMEP. Section 3 studies the geometry of the nonconvexities as- 
sociated with the log-likelihood function. Section 4 presents the CLA and 
its convergence analysis. Section 5 studies the finite-sample properties of 
the Wald, Likelihood Ratio, Lagrange Multiplier and the Bartlett-corrected 
Likelihood Ratio Tests. It also contains a computational investigation of the 
properties of the CLA in comparison to some widely used heuristic methods. 
Section 6 conveys an extension of the analysis to general BFP-like problems. 
The Appendix contains the following: the pertinent Convex Analysis defi- 
nitions; an explanation of the relation between the EMEP and the BFP; a 
special-purpose algorithm for solving the EMEP; and an alternative conver- 
gent algorithm, called Discretization Algorithm, for solving the BFP. 

2. Lifting and the EMEP. Recall that our goal is to maximize the log- 
likelihood function of two independent random samples {Xi} i J; 1 and {Yi}^3 1 , 
where Xj ~ -Y(/x, Si) and Yj ~ -/V(/i, £2) are ci-dimensional (random) vectors. 
From now on we assume that the sample covariance matrices S± and S2 are 
invertible. The maximization problem means that we should find fi, Si and 
E2 that maximize 

1 Nl N 
Z(/i,Si,£ 2 ) = "2 E( X ' " ^iHXi - p) - -y logdetEi 

1 2 N 
- -J2(Yi - fiy^HYi - fi) - logdet S 2 , 
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which is a highly nonlinear function of /i, £1 and £2. 

Recently, a more (computationally) tractable reformulation of (1) was 
proposed by Buot and Richards [5]. We restate it here as a lemma. 

Lemma 2.1 (Buot and Richards [5]). Denote the vector sample means 

by 

-1 Ni , N 2 

(2) x = ^rY x i and Y = — Vy () 

and the sample covariance matrices by 

1 Ni -, N 2 

( 3 ) S^—Y^Xi-XXXi-X)' and S 2 = — £(K t - Y)(Y - Y)'. 

Nl i=l ^ i=l 

Assume Si and S 2 are invertible, and let fi, be some possible value, or 
estimator, of fi. The original problem of maximizing the likelihood function 
in jjL, Si and S2 can be reduced to the minimization in fi of 

(4) (1 + (x - nys^\x - v)) Ni/ \i +{y- nys^iY - n)f 2/2 . 

Expression (4) is already much more tractable than the original likelihood 
since it depends only on fi. However, the likelihood maximization problem 
can become substantially more amenable to analysis if it is reformulated as 
a suitable mathematical programming problem. We can do that by lifting 
it to a higher-dimensional setting, that is, by including additional variables 
and constraints, and recasting it in the following way. 

Definition 2.1. The Lifted Gaussian Maximum Likelihood Behrens- 
Fisher Problem is to solve 

Ni N 2 
min f(ui,u 2 ) = — log(ui) + — log(w 2 ), 

/J.,Ul,U2 2 2 

(5) ut^l+iX-^S^iX-fi), 

u 2 >l + (Y - ^S^iY - fi). 

Since the solutions for the Lifted Gaussian Maximum Likelihood Behrens- 
Fisher Problem and the original Gaussian Maximum Likelihood Behrens- 
Fisher Problem must coincide, we will use the acronym BFP to refer to the 
former from now on. 

The advantage to the lifting procedure is to confine the nonconvexity of 
the problem to just two variables, u\ and u 2 . Nevertheless, the objective 
function / in (5) still poses a computational challenge since it is noncon- 
vex. This means that we can still expect the existence of local solutions as 
suggested in [5], and further analysis is called for. 
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One may note, though, that / is increasing in u\ and u<i- Moreover, if one 
of the variables, say, u\, is fixed, then the problem becomes fairly simple: 
for each value of u\, we can obtain a solution u^iui). The same can be done 
with u\ as a function of u%- Therefore, associated with (5), we could think 
of a family of tractable "subproblems" (parameterized by u%, e.g.). Next we 
will show how to relate the solutions to this family of subproblems to the 
solution of the original problem. 

Let us focus on the constraints in (5). For a given ju (a "solution"), consider 
the squared Mahalanobis distance functions 

(6) Mx@) = {X- nyS^iX - p) and M Y (J1) = (Y - /2)'S^(Y - £). 

Note the resemblance between such functions and the generalized distance 
function G as defined in Kim [16]. They all give ellipsoids in /I, but our use 
of the functions is different. 

Definition 2.2. The Ellipsoidal Mean Estimation Problem with re- 
spect to X at level v\ is to solve 

(7) h x {vi) :=min{M Y (fi):M x (n) <M 
(analogously for Y). 

In words, the EMEP with respect to X at level V\ is to find the estimate 
/^emep of that minimizes the squared distance M. Y under the constraint 
that the squared distance M. x is bounded by v\. The use of the word "es- 
timate" can be justified in at least two ways. First, Gaussian maximum 
likelihood estimation is based upon finding a vector estimate /Iemep that 
minimizes a similar quadratic form. Second, the procedure above enjoys the 
reasonable property that if X and Y are close (in particular, equal), the 
solution //emep will also be close to Y (in particular, equal). 

Even though the EMEP is simpler than the BFP, there is no closed-form 
solution for the former (for given v\). Nonetheless, EMEP is, in fact, a con- 
vex problem and can be solved efficiently by a variety of available methods 
like gradient descent, interior-point methods, cutting-planes, and so on. Al- 
though all these methods are convergent and a few have good complexity 
properties (see [3, 13, 22]), in the Appendix we propose a specific algorithm 
which explores the particular structure of the problem. Not surprisingly, it 
enjoys better complexity guarantees and better practical performance than 
the aforementioned methods. 

The BFP and the EMEP are, in fact, closely related. The BFP consists 
of achieving the optimal balance between the EMEP for X and Y simulta- 
neously. This happens because the BFP is based upon the minimization of 
a function that is monotone in both distance functions. A precise charac- 
terization of the relation between the BFP and the EMEP is given in the 
following theorem. 
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Theorem 2.1. Let (/x, 7x1,1x2) be a solution to the BFP. Then, Ji is a 
solution to the EMEP with respect to X {with respect to Y) at v\ = M-x^fi) 
[at v 2 = My{V)]- 

Proof. Without loss of generality, we will develop the argument only 
for the EMEP with respect to X. 

Let //EMEP be a solution to the EMEP with respect to X at some positive 
v\. By the monotonicity of log, this means that the expression 

2Vi No 

(8) -f\og{l+ Vl ) + ^\og{l+M Y {ii)) 

is minimized at /xemep- 

Now, let (p,,ui,U2) be a solution to the BFP problem. This means that 
the expression 

(9) ^ log(l + M-xifJ')) + ~y log(l + MyQj)) 

is minimized at Jl and we have ui = 1 + M x (fi). Since expression (9) is 
an upper bound for expression (8) when we set v\ := M. X {V), ft is also a 
solution to the EMEP with respect to X at v\. □ 

Remark 2.1. Since S\ and S 2 are positive definite matrices (not only 
semi-definite), for each level of v\ the EMEP has a unique solution. However, 
this does not guarantee that the BFP also has a unique solution, since it 
could achieve the optimum at two different levels of the distance function. 

3. The underlying geometry of the lifted Behrens Fisher Problem. In 

this section we study the nature of the nonconvexities in (5), and we show 
how the feasible set is related to the EMEP. In particular, we obtain a 
convenient representation of the border of the feasible set that will be used 
in the algorithm developed in Section 4. 

We start by considering the projection of the set of feasible points in (5) 
into the two-dimensional space of u = (141,1*2): 

(10) /C=((n 1 ,n 2 )GM 2 :3/x such that Ul Z ! + 1 • 

I «2>1+%(W J 

Figure 1 illustrates the geometry of K,. Since M is a convex function, /C is 
a convex set. Also, K, is unbounded, since (1x1,1*2) € K, implies that (ui + 
71,1x2 + 72) € /C as well for arbitrarily values of 71,72 > 0. Clearly, u E K, 
implies that u\ > 1 and 1*2 > 1. 

Since the objective function of (5), f(u) = / (1x1,1*2) = ^log(i*i) + 
log (1x2)) depends only on the variables it, the optimal value of (5) equals 

(11) min{/(ix):uG/C}, 
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which still is a nonconvex minimization and potentially has many local min- 



ima. 



However, the representation (11) has two desirable features. First, it com- 
pletely separates the (nonconvex) minimization problem in two variables 
from the high dimensionality of fi. This will be key to avoid the curse of 
dimensionality. Second, we can write out a compact region that contains the 
solution for (11). Define the following problem dependent constants: 



Li = min{l + M x {fj)} = 1, 

U 2 = mm{u 2 : (L u u 2 ) € K} = 1 + M Y (X), 



(12) 



L 2 = min{l + M Y {n)} = 1, 
U x = min{m :(u 1 ,L 2 )eJC} = l+M x {Y). 

Ml 

These quantities define a right triangle 
(13) {(Z 1 ,L 2 ),(Z 1 ,f7 2 ),(C7 1 ,L 2 )}, 

which contains the optimal solution u* = {u\,u 2 ) for (11). In fact, observe 
that, by monotonicity, all points in K, above or to the right of the hypotenuse 
of the triangle have a larger objective value than a point on the hypotenuse. 
Moreover, the remaining points of K, are contained in the triangle. Therefore, 
the coordinates of the triangle vertices in (13) are lower and upper bounds 
on the optimal solution (ul,u 2 ), that is, 

Li<u\<Ui, L 2 <u* 2 <U 2 . 

Th> convex set /C 



1.6 



1 i.i \S 1.3 1.4 1.5 1-6 1.7 1.8 1-9 

"I 



Fig. 1. The convex set K, consists of every point on and above the curve. 
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In particular, if X = Y, the triangle degenerates into a single point (as 
pointed out in [5], the solution is trivial in this case). 

Nevertheless, there is a representation cost associated with (11), in the 
sense that there is no closed-form representation for /C involving only the 
variables u. 

For this reason, we will make use of an additional function g that gives 
information about (part of) the border of K. (which is where the global opti- 
mum is expected to be found, given the quasi-concavity of /). The function 
g is defined as 

(14) g(m) := min{u 2 : (ui,u 2 ) 6 K,}. 

By construction, a point (1(1,112) is in JC if and only if 112 > g{u\). It is easy 
to show that the function g is convex (its epigraph is exactly the convex 
set K) and decreasing in u\. 

Note that the function g is directly related to the EMEP with respect to 
X and the function hx, since 

g(ui) = 1 + min Myifj) = 1 + hx(ui - 1), 

(15) 

ui - l>M x {fi). 

In other words, evaluating g at u\ involves solving an EMEP with respect 
to X. 

4. An algorithm for the Behrens Fisher Problem. In this section we 
propose an algorithm, called the Cutting Lines Algorithm (CLA), that gen- 
erates an e-solution for the BFP. This means that the algorithm reports a 
feasible solution at which the objective function value lie within at most 
e from the value of the objective function at the optimal solution. Since 
the feasible solution is given for arbitrary e > 0, convergence to an optimal 
solution holds. 

The CLA builds upon a polyhedral approximation to the set /C. The 
method optimizes the objective function / over /Q, at each iteration. The 
minimizer point (u\,U2) € /Cfc is used to improve the polyhedral approxima- 
tion for the next iteration. 

As mentioned in the introduction, it is possible to propose an algorithm 
based upon the discretization of the range of values of u\ where we need to 
evaluate g{u\). Such an algorithm, which we call a Discretization Algorithm 
(DA), can be proved to have better worst-case complexity guarantees than 
the ones obtained for the CLA. However, Section 5 shows that the practical 
performance of the CLA strongly dominates that of the DA, since the latter 
requires evaluating the function g — that is, solving an EMEP [see expression 

(15) ] — at every point of the discretization. Thus, we focus on the CLA and 
defer the details of the DA to Appendix C. 
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4.1. The cutting lines algorithm. A good way to develop an algorithm 
for the BFP is to think of constructing sets that (i) approximate fC and (ii) 
have a simple description involving u. Given the convexity of /C, polyhe- 
dral approximations to the set /C are a natural candidate. Moreover, such 
approximations are rather convenient because it is simple to minimize the 
objective function / over polyhedral sets in two dimensions (see Lemma 4.1 
below) . 

4.1.1. Building polyhedral approximations to fC. Our sequence of poly- 
hedral approximations will be based upon the function g. Given the results 
for the EMEP, relation (15) implies that, for any fixed value of u\, not only 
can g(u\) be efficiently evaluated, but also a subgradient s G dg{u\) (see 
Lemma B.l for details) can be easily obtained. Suppose we choose a set of 
points {iti}f=i and gather the triples 

04,s(?4),s*}, s*ed 5 (4), % = !,..., h. 

By the definition of subgradient, we have that 

g{u\) > g(u\) + s l (u\ — u\) for all i = 1, . . . ,k and u\ G R. 

Therefore, we can build a minorant polyhedral approximation for g as 
follows: 

(16) §fc(«l) = max {g{u\) + s\ui - u[)}. 

l<i<k 

In turn, such a function can be used to build a polyhedral approximation 
for K- defined as 

K-k = {(ui,u 2 ) G R 2 :u 2 > dk{ui)}- 

Figure 2 illustrates these relations. 1 

The advantage of working with the polyhedral approximation JC^ instead 
of /C is two- fold. First, K,t has a much nicer representation (via linear in- 
equalities or extreme points) than K, itself. This is particularly interest- 
ing for developing algorithms, which is our goal here. Second, as we an- 
ticipated, the minimization of the desired objective function f(m,U2) = 
^Tog(iti) + 4plog(ii2) on /Qc is rather tractable, as we show in the fol- 
lowing lemma. 

Lemma 4.1. Let ICk C R++ be a {convex) polyhedral set. Then the func- 
tion 

N\ N 2 
f(u 1 ,u 2 ) = — log(ui) + — log(w 2 ) 

is minimized at an extreme point of ICk . 



1 Such approximation for convex sets can be traced back to the Cutting Planes Algo- 
rithm in the Optimization literature [3, 13, 14]. 
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Hk* s'ts fC. and K 
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1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1,8 1.9 



Fig. 2. The convex set K, and its outer polyhedral approximation ICk- The extreme points 
of ICk are the kinks of the graph of the piecewise linear function gk . 



Proof. First, note that since ICk C R+, and because the nonnegative 
orthant is a pointed cone, ICk must have at least one extreme point. Second, 
the optimal solution cannot be an interior point of ICk (otherwise, we can 
strictly decrease both components simultaneously). Third, we recall that / 
is a differentiable quasi-concave function. Therefore, its gradient is a sup- 
porting hyperplane for its upper level sets, which are convex. 

Next, suppose that the minimum is achieved at a nonextreme point of 
ICk, say, x* = az + (1 — a)y, for a 6 (0, 1) and extreme points z, y. By the 
first order conditions, the gradient of / induces a supporting line for fC at 
x* on which both z and y lie. By the (strict) convexity of the upper level 
sets of /, min{/(z), f(y)} < f(x*), a contradiction. □ 

Since ICk is an outer approximation to /C, minimizing / over ICk yields a 
lower bound on the optimal value of (5) for every k. Figure 3 illustrates the 
minorant approximation of f(ui,g(u\)) induced by f(u±, gk{u\)). 

4.1.2. The algorithm. The CLA draws upon the minimization of the ob- 
jective function over the polyhedral approximation ICk to /C, which, as shown 
in Lemma 4.1, needs to be carried out only over the extreme points of ICk- 
A brief description of the algorithm follows. At iteration k, one has a set f l , 
i = 1, . . . , k, of values of the objective function at points (u\ , u % 2 ) ,i = l,...,k, 
respectively. The values /' are then compared to := fiu^u^), where 
(S^K!) is the solution to the minimization of / over ICk- If the distance 
mino<j<fc(/ 1 — f k ) is small enough (note that f l > f k ), the algorithm stops. 
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Otherwise, it takes a new point u k+l , slightly to the right of u k , and gener- 
ates its corresponding u k+1 := g(u k+1 ) by solving an EMEP. The evaluation 
of the objective function / at the pair (u k+l ,u k+1 ) gives a new f k+1 , and 
the algorithm starts over. 

Cutting Lines Algorithm (CLA) 

Input: Tolerance e > 0, u\ = min{[7i, (1 + e/A / " 1 )L 1 }, % = 1, k = 1. 

Step 1. Evaluate n| = and s k € dg(u k ). 

Compute f k = ^ log(ttf) + 4^ log(«§). 

Step 2. Define gk{u\) = maxo<i<fc{u 2 + s l (iii — u^)}. 

Step 3. Compute fk = min{/(«i, 112) ■ U2 > gk( u i), u i > L{\ and the 
corresponding point u k = (u k ,u k ). 

Step 4. If min <j</t(/* — f k ) < e, report min <j<fc f % and correspondent 
pair (u % *,u 2 *). 

Step 5. Else set u k+1 <- minfiTi., Of (1 + e/iVi)}, fc <- + 1, and 
goto Step 1. 

Note that each time a new iteration (say, k + 1) starts, an updated polyhe- 
dral approximation fCk+i is constructed through the introduction of a new 
cut, based on the subgradient dg(u k+1 ). A new cut removes one extreme 
point and creates at most two new extreme points. Therefore, the compu- 
tational effort of minimizing / over fCk grows only linearly with k (in fact, 
by keeping track of previous evaluations, re-optimization can be done even 
faster). 



■he objective [unttiorl ard its tower approximation The obfflCHva function end <:± lower approximation 




S 10 15 20 25 30 35 40 07 0.8 0.9 1 1.1 1.2 U V 1.5 1.6 



Fig. 3. The outer polyhedral approximation for K, leads to a minorant approximation 
for f . Therefore, lower bounds on the optimal value of (5) are derived if we minimize the 
minorant approximation f. The right figure is a zoom in on the dashed square area of the 
left figure. 
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The next theorem shows that the CLA needs only a finite number of 
iterations to compute a e-solution. 

Theorem 4.1. The CLA reports an e-solution to the original problem 
in at most [iM^iM] i oops , 

PROOF. For k > 1, note that u\ +1 < ^(l + e/iVi), and suppose first that 
y-2 +1 — ^2(1 + £ /^2)- In this case, we have 

= f logK +1 ) + f log(n^) 

< e +^]og(2*) + ^iog(a5) 

and we have a e-solution, since U2 +1 ) ^ s feasible. 

Alternatively, if u\ +l > «|(1 + e/N 2 ), we have u^ +l > 1, which implies 
that < C7i. Therefore, it^ 1 = (1 + e/Ni) and the next Cutting Lines 
approximation removes at least a rectangle of area N ^ N u^u^ between the 

difference of and /C. Since the area difference between these sets was 
bounded by U1U2/2 at the very first iteration, the algorithm performs at 
most 

-{tJ x U 2 ){N x N 2 )- 
2e 2 

loops. □ 

This computational complexity result immediately yields the following 
convergence results. 

Corollary 4.1. For jO, let (u^u^) be the Ek-solutions to (11) and 
let the vectors u\, u\) be their induced solutions to the {Lifted) BFP. 
Then, every accumulation point of the sequence {(fi k , u\, uDlfceN * s a solu- 
tion to the BFP. 

Corollary 4.2. The CLA can be used to generate a sequence of points 
that converge to a global solution to the (Lifted) Behrens-Fisher Problem. 

4.2. Computational experiments with CLA and DA. Our complexity 
bound for the CLA is worse than that for the DA. However, the DA solves 
the EMEP for every point of the discretized domain of u±. In contrast, the 
CLA seeks to produce a certificate of e-optimality at each iteration by com- 
paring the best current solution and the solution to the minimization on fC^. 
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Table 1 

Computational times (in seconds) and total number of iterations (which equal the number 
of EMEP problems solved) of the computational experiments with relative tolerance 

£ = 1CT 3 



Medium size instances Average running times (in seconds) Average iterations 



d 


Ni 


JV 2 


Initialization 


DA 


CLA 


DA 


CLA 


20 


100 


200 


0.01 


4.45 


0.006 


6853.2 


15.4 


30 


150 


300 


0.02 


12.41 


0.007 


10859.6 


17.5 


40 


200 


400 


0.03 


12.92 


0.009 


9256.8 


17.4 


50 


250 


500 


0.05 


13.46 


0.010 


8414.6 


18.5 


60 


300 


600 


0.08 


23.71 


0.012 


10495 


17.7 


70 


350 


700 


0.13 


24.15 


0.014 


8502.1 


17.6 


80 


400 


800 


0.18 


42.40 


0.020 


9912.7 


18.7 


90 


450 


900 


0.24 


64.46 


0.025 


11796.4 


18.3 


100 


500 


1000 


0.32 


67.46 


0.036 


9859.5 


19.0 


Large size instances 


Average running times (in 


seconds) 


Average iterations 


d 


Ni 


iV 2 


Initialization 


DA 


CLA 


DA 


CLA 


200 


1000 


2000 


2.07 




0.23 




20.8 


300 


1500 


3000 


6.64 




0.66 




19.8 


400 


2000 


4000 


16.08 




1.66 




20.1 


500 


2500 


5000 


43.35 




3.13 




21.2 


600 


3000 


6000 


56.62 




5.71 




21.5 


700 


3500 


7000 


87.88 




6.88 




22.0 


800 


4000 


8000 


142.05 




12.71 




20.9 


900 


4500 


9000 


455.23 




23.59 




22.1 


1000 


5000 


10000 


671.80 




28.25 




22.3 



In computational practice, this drastically reduces the number of necessary 
iterations to find an e-solution, as can be seen in Table 1 (this table was 
generated in the same way as the Monte Carlo study of the tests sizes, as 
described in Section 5.2 below). Each entry of running times and iterations 
in Table 1 is an average over ten instances. 

Table 1 reflects the expected computational behavior of the methods. 
As the dimension increases, more effort is needed but the CLA is order 
of magnitudes faster than the DA, since the latter requires the complete 
discretization of the interval \Li,Ui\. Such requirement of evaluating the 
function g on 0(1/ e) different points (remember that the complexity analysis 
is exact in the case of the DA) seems to be a naive approach, indeed. 

The polyhedral approximation used in the CLA provides a way of focusing 
the search on a promising region, a concept well exploited in the Optimiza- 
tion literature. Table 1 also illustrates the number of loops required by each 
algorithm in the test problems. 
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The number of loops performed by the Discretization Algorithm depends 
only on the precision e, and on the problem dependent values of L\ and 
tl\. On the other hand, these problem dependent quantities do not seem 
to affect the CLA. This points to the question of whether there exists a 
(better) complexity analysis for the CLA which might be independent of 
these quantities. 

The implementation of the algorithms is a simple task in any programming 
package where matrix inversion and spectral decomposition subroutines for 
positive definite matrices are available (e.g., R, Matlab, etc.). The remain- 
ing algorithmic operations (binary search, computation of extreme points, 
stopping criterion, etc.) follow a relatively simple logic and do not involve 
potential numerical instabilities. We do not claim to have the most efficient 
implementation of the methods proposed here. Nevertheless, our numerical 
results show that the CLA is computationally efficient and scales quite nicely 
as the data dimension d increases. The underlying reason is the certificate 
of optimality that the method is constructing on each iteration. The value 
/min provides a lower bound for the optimal solution which is used to con- 
struct a stopping criterion. For a problem whose dimension is greater than 
one thousand, numerical approximations on the computation of the spectral 
decomposition are a potential limitation of the method to solve the EMEP 
proposed in Appendix B. An alternative approach is to compute an inverse 
matrix at each iteration of the EMEP, which will lead to a more robust 
implementation at the cost of additional running time (see [7] for details). 

In our experiments we use medium and large size instances where the 
data dimension d varies from 20 to 1000. The results were generated using 
a relative precision of e = 10 -3 . We report the average over ten different 
instances. The DA has proved to be too cumbersome for large instances. 



5. Finite sample properties of the Wald, Likelihood Ratio and Lagrange 
Multiplier tests through the CLA. Three commonly used multivariate tests 
based upon the maximization of the log-likelihood function are the Wald 
(W), Likelihood Ratio (LR), and the Lagrange Multiplier (LM) Tests. De- 
fine 9 = (fj,x, fj,2, Si, E2). For a certain hypothesized restriction on the pa- 
rameter space of means 

let 9 denote the unrestricted MLE of 9, and let 9 r denote the MLE under 
the restriction Hq, that is, the solution to the problem 

max 1(9) 
e 

subject to c(/xi,/X2) = q- 
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The test statistics of interest are denned as 



W = [c(p, 1 ,p 2 ) -?]'(Var(c(/xi,/i2) - ?)) 1 [c(fii , p, 2 ) - q] 
LR = -2(l(9 r ) - l{6)) 



and 



LM = e T G r [G^G r ]- 1 G^e, 



where 



(17) 





Vg; log f(xi, r ) and g\ T = log f(y h 9 r ) 



(/is the multivariate density function in question) and e is a vector of ones. 
In the context of the BFP, the restriction can be written as H\ — fx 2 = and 
the W test statistic has the explicit form 



The W Test — which is a pure significance test — bears the computational ad- 
vantage of not requiring the solution to the problem of finding the restricted 
MLE estimator (however, see Section 5.2 below). 

The W, LR and LM Tests are asymptotically equivalent under the null 
hypothesis. However, their behavior can be rather different in small samples, 
and their finite sample properties are usually unknown, except for a few 
particular cases (see, e.g., Greene [12] and Godfrey [11]). In this section we 
use the CLA to investigate and compare the finite sample properties — size 
and power — of these tests. In particular, we are interested in the sensitivity 
of the tests to dimensionality. 

We emphasize that the CLA allows for the study of the properties of 
the tests in high-dimensional contexts. In contrast, the literature on the 
BFP typically overlooks the issue and reports results for small dimensional 
problems, typically smaller than d = 6 and in general no greater than d = 10. 

5.1. Conflict among criteria. It is well known that the W, LR and LM 
statistics for testing linear restrictions in the context of classical linear mod- 
els satisfy the inequalities W > LR > LM (see Savin [25] , Berndt and Savin 
[2], Breusch [4] and Godfrey [11]). Before turning to simulations, we show 
that such inequalities also hold in the case of the BFP. 

Theorem 5.1. For the BFP, 



W = (X - YYiSt/m + S 2 /N 2 )- 1 (X - Y). 



(18) 



W > LR > LM. 
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Proof. To show the first inequality, note that, using since log(l + 5) < 6, 
we have 

LR<c = miniVi(X - //)5f \X -lj,)+N 2 (Y- f^S^fX - fi). 

The optimal solution of the right-hand side is achieved at /2o = (NiS^ 1 + 
N 2 S2 1 )~ 1 (N 1 S^ 1 X + N 2 S 2 ; 1 Y). Using ju , and the matrix identities 

(A + B)- 1 = A- 1 - A' 1 (A- 1 + B-^A" 1 = A~ 1 (A~ 1 + B" 1 )- 1 ^ 1 , 

we prove that c = (X - Y)'(Si/Nx + S 2 /N 2 )~ 1 (X — Y) = W. 

Let ft, be a solution for the BFP. After simplifications, the LAf statistic 
can be written as 

LM = N 1 (X- ft)% l {X + N 2 (Y - fi)%\Y - ft). 

Next note that 

(x - ft)'t T \x -fj) = (x- ft)' SI \x - ft) [{X - ®' s ^ x - * )]2 



i + {x-ft)>s^\x-ft) 

by using a rank-one update formula 2 for Sj -1 . The result follows by consid- 

s 2 

1+5- 



c2 

ering the term for Y as well and noting that log(l + 5) > 5 — ttx- □ 



5.2. Monte Carlo study of the size of the test. Inequalities (18) imply 
that the rejection rate of the W Test is greater than or equal to that of the 
LR Test, which in turn is greater than or equal to that of the LM Test. A 
more accurate understanding of the extent to which this influences the size 
and the power of such tests can be obtained through simulations. 

We performed a Monte Carlo study of the finite-sample properties of the 
W, LR and LM tests at sizes a = 0.01,0.05,0.10. The rejection regions were 
defined based upon Wilks' Theorem on the asymptotic \\ distribution of 
the test statistic. 

The study also includes the Likelihood Ratio statistic with the Bartlett 
correction 



where 

■01 - tjj 2 



Cl 



d 1 

? N$(N-2) i 2 N?(N-2) 2 

^ = iv^vw) {tT{SlS )} + iv^vw) {tr(5a5 )} ' 



2 For invertible M and a vector v, the inverse of the rank-one update of M by vv' can 
be written as (M + vv')' 1 = M" 1 - ^Zhpi^l , 

v ' l-\-v'M~ L v 
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^ = ^(jvi-i) { ( 1 1 )} iv 2 (iv 2 -i) { ( 2 2 )} ' 

and S = ^S l + ^S 2 . 

The Bartlett correction as denned above provides an 0(N~ 2 ) approxi- 
mation to the mean of the \\ distribution (more details can be found in 
Yanagihara and Yuan [36]). We will refer to the LR Test under the Bartlett 
correction as the B Test. 

To facilitate comparison with other works on the multivariate BFP (e.g., 
Yao [37], Subrahmaniam and Subrahmaniam [33], Kim [16] and Krishnamoor- 
thy and Yu [18]), we performed tests for the low dimensional cases of d = 2, 5 
and 10, but we also included the higher-dimensional cases of d = 25,50,75, 
100 and 200. For each d, the sample sizes used were N\ = 5d, lOd, 20d, and 
N2 = 2N\. For a given dimension size d, each covariance matrix £3, i = 1,2, 
was constructed by creating an initial matrix Mi with N(0, 1) entries, and 
then setting Sj = MiM[. 

The results can be seen in Figure 4 (the actual numerical output can 
be found in Table 4 in the Appendix D). Each entry was generated using 
10,000 runs. The W and the LR tests tend to over-reject the null hypothesis, 
while the LM Test tends to slightly under-reject it. We kept constant the 
ratio between the number of observations and the dimension so that we can 
observe how the quality of the approximation behaves as the dimensionality 
of the problem grows. One may notice how sensitive the W and the LR Tests 
are to increases in the dimension. Only for the (relatively) large sample case 
N\ = 20d does the LR Test have actual size fairly close to a. On the other 
hand, the W Test appears to demand even (relatively) larger samples. For 
instance, when d = 100 and a = 0.10, even when N\ = 20d, the W test is off 
by 3.8 percentage points. The ease of computation of the W test statistic 
appears to come at a considerable price in terms of the accuracy of the test. 

In contrast with the W and the LR Tests, the LM shows remarkable 
robustness with respect to dimensionality. For all a, there does not appear 
to be any clear (say, monotonic) pattern of change on the actual test size 
with respect to increases in dimensionality, or maybe even sample size N\. 

For all values of a and different sample sizes, the B Test is roughly as 
accurate as the LM Test for low dimensional settings (roughly, d < 20). For 
d > 20, though, it grossly over-compensates the over-rejection rates of the 
W Test, with the possible exception of the comparatively large sample sizes 
iVi = 20d. 

Figure 4 illustrates the above comments. Accordingly, the W Test usually 
shows the steepest curve of dimension versus actual test size for different N\ , 
while the LM Test displays approximately horizontal curves, especially for 
higher-dimensional settings. 
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Fig. 4. The behavior of the size of the tests when the dimension increases and the ratio 
between the number of observations and dimension is fixed. 
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5.3. Monte Carlo study of the power of the test. We performed compu- 
tational experiments on the power of the W, LR, LM and B Tests for the 
cases of dimension d = 10, 50, 100, and sample sizes N\ = 5d, lOd and 20d, 
with N 2 = 2Ni . 

The analysis of the power for multivariate tests is naturally more difficult 
due to the multi-dimensionality of the parameter space. For this reason, 
we chose to investigate and compare the power of the W , LR, LM and B 
Tests over a standardized parameter space in the following sense. For each 
simulation run, covariance matrices Si and £2 were (randomly) generated 
through the same procedure as the one for the evaluation of the sizes of 
the test. The mean of X, was set to zero by default. The choice of the 
mean(s) of Y, /22(A), was made as solution(s) to the squared Mahalanobis 
distance equation(s) 

( W - /x 2 (A)) / (E 1 + £ 2 )~ V - MA)) = A 2 , 

where A represents a family of appropriately selected constants. For conve- 
nience, such solutions /22(A) were always taken on some canonical axis, and 
the specific axis chosen changed across simulation runs. The use of randomly 
standardized Mahalonobis distances is justified by the fact that the BFP is 
defined without information on the population covariances. 

The results are depicted in Figure 5, which contains plots for dimensions 
d = 10, 50 and 100. Colors represent tests, while geometric figures represent 
sample sizes (e.g., a triangle symbolizes N\ = 5d). 

Perhaps the most striking feature of all four plots (d = 10, 50 and 100) is 
the fact that, for a given sample size N\, the shapes of the power curves for 
the four tests look alike. More specifically, given N\, the curve for the W Test 
looks like an up-shifted version of the curve for the LR Test, which in turn 
looks like an up-shifted version of the curve for the LM Test. The same is true 
for the curve for the B Test, which lies mostly below the curve for the latter. 
The observed "order" of the curves should not come as a surprise. First, 
regarding the W, LR and LM Tests, because of the theoretical inequalities 
in Theorem 5.1. Second, because the simulation results for the test sizes 
show that the W and LR Tests tend to over-reject the null hypothesis (the 
former, substantially more than the latter), while the LM Test has size 
close to a and the B Test tends to under-reject the null hypothesis. In 
other words, we are essentially comparing tests of different sizes (see also 
the conclusions in Breusch [4] for the case of linear regression). The shape 
of the curves suggests the possibility that, if test size adjustment is made 
for the W and LR Tests, the power curves of the three tests may get rather 
close to each other. Such adjustment would imply, of course, going beyond 
Wilks' Theorem and developing exact quantiles, especially for the W and 
the LR Tests. 
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Fig. 5. Monte Carlo study of the power of the W , LR, LM and B Tests for the size 
a — 0.05 with different sample sizes and dimensions equal to 10, 50 and 100. 
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The plot for the low-dimensional case of d = 10 displays a "well-behaved" 
pattern, in the sense that the curves for different tests and for the same 
sample size tend to be grouped together. In particular, the curves for sample 
size N\ = 40d are almost super-imposed, which means that, power-wise, the 
tests are nearly equivalent in this situation. Note that the curves for sample 
size Ni = Wd (triangle) lie above the remaining ones close to the origin, that 
is, in the case where the Mahalanobis distance between ji\ and \i2 is small. 
Again, this should not come as a surprise, since the simulation results for 
the test sizes (i.e., zero Mahalanobis distance between [i\ and show that 
relatively small sample sizes imply a tendency for over-rejection in the case 
of the W and LR Tests. 

The effect of higher dimensionality can be seen in the two remaining plots 
id = 50 and 100). The main impact seems to be greater vertical distances 
among the curves for the four tests, particularly for the cases of smaller 
sample sizes. Even for the higher-dimensional case d = 100, though, the 
larger sample size N\ = 20d brings the curves a lot closer to each other. As 
one might expect, larger sample sizes compensate for high dimension and 
point to the asymptotic equivalence of the W, LR, LM and B Tests. 

5.4. Performance of local methods/heuristics. Up to the present, the nu- 
merical procedures applied to the multivariate Behrens-Fisher Problem have 
been heuristics or locally convergent methods. Since the CLA is a provably 
convergent method that constructs a certificate of global optimality, it pro- 
vides a benchmark for the previous approaches. So, we are now able to 
address via Monte Carlo experiments the statistically important question of 
the performance of the LR Tests based on some widely used heuristics vis- 
a-vis the LR Test based on the CLA. Also, we are interested in the partially 
related issue the computational performance of these heuristics vis-a-vis the 
CLA. 

There is a variety of different heuristics and it is usually hard (if not 
impossible) to make any general statement about them. However, in the case 
of the Behrens-Fisher Problem, we do have a "natural" initial point for these 
algorithms, that is, /2 := (AiSf 1 + N 2 S2 1 )~ 1 (N 1 S^ 1 X + N^S^Y). In fact, 
JIq is at the same time: (i) from the algorithmic perspective, the solution to 
the first-order conditions of the objective function (the log- likelihood) with 
respect (only) to /i, after we substitute Si for Ej, i = 1,2; (ii) from the 
statistical perspective, the estimator of the mean fj, associated with the W 
statistic. Also, by the proof of Theorem 5.1, we have 

(19) W > LR > LR, 

where LRq is the log-likelihood ratio evaluated at fio. Denote by LRh the 
(potentially suboptimal) Likelihood Ratio test statistic based upon a given 
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heuristic method and with juo as its initial point. We can assume (through 
an ad-hoc modification of the heuristic, if necessary) that 

(20) LRq > LR h . 

Thus, by (19), (20) and the fact that LRh > LR, the statistic LRh also 
asymptotically follows a x\ distribution. Nonetheless, the gap between the 
W and the LR statistics can be quite large in finite-samples (see Section 5). 

Note that a LRh Test can only disagree with the LR Test if the LRq Test 
rejects Hq and the latter accepts Hq. We can perform Monte Carlo experi- 
ments (in the same way as in Section 5.2 for the tests sizes) to estimate how 
often the LRq and LR Tests disagree. By (20), this provides a guarantee (i.e., 
an upper-bound) on the "discrepancy rate" between a LRh Test (i.e., based 
on any heuristic) and the LR Test. The results in Table 2 show that this 
worst-case-scenario discrepancy rate is surprisingly small. The discrepancy 
rate for the W Test with respect to the LR Test — substantially higher — was 
also included for the sake of comparison. 

Next we study the computational performance of three commonly used 
methods: Simulated Annealing (SA), Iterative Update (It Up) and Newton's 
Method with Line Search (NM). (See, resp., [17, 19], [5] and [7] for discus- 
sions and implementation details of these methods.) 

Table 3 reports the average performance of the heuristics with respect to 
running times, iterations and discrepancy rate based on their respective LRh 
Tests. As expected, the discrepancy rate is smaller than in Table 2, since 
the heuristics usually provide a solution superior to JIq. Not only that, the 
experiments suggest that both ItUp and NM are robust in terms of discrep- 
ancy rate (with fa as the initial point) even though they can be trapped in 
local minima. Moreover, their good (local) convergence properties are illus- 
trated by the notably small number of iterations. On the other hand, SA 
seems to have trouble achieving local convergence, and its good performance 
with respect to errors appears to be a by-product of the chosen initial point. 
Not surprisingly, the convergence of the CLA turned out to be slower (i.e., 

Table 2 

Monte Carlo study of the discrepancy rates for LRo and W Tests with respect to the LR 
Test for different dimensions d (for each entry, simulations were run until 50 
"successes" were obtained) 



d 


2 


5 


10 


20 


30 


40 


LRo 


0.00176 


0.00113 


0.000932 


0.000978 


0.000825 


0.00135 


W 


0.0299 


0.0325 


0.0261 


0.0468 


0.0342 


0.0513 


d 


50 


60 


70 


80 


90 


100 


LRo 


0.0009 


0.0012 


0.0012 


0.0013 


0.0017 


0.0012 


W 


0.0426 


0.0702 


0.0641 


0.0796 


0.0809 


0.0935 
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Table 3 

The average performance of heuristics averaged over 5000 runs: time (seconds), 
iterations and discrepancy rate 



Algorithms 



d 




SA 






ItUp 






NM 




Time 


Iter 


Discrep 


Time 


Iter 


Discrep 


Time 


Iter 


Discrep 


2 


0.02462 


1000 


0.001 


0.00062 


4.2 





0.00118 


2.3 





5 


0.02547 


1000 


0.001 


0.00081 


4.5 





0.00135 


2.6 





10 


0.02680 


1000 


0.001 


0.00121 


4.7 





0.00166 


2.6 





20 


0.03120 


1000 


0.001 


0.00273 


5.0 





0.00310 


2.7 





30 


0.04078 


1000 


0.001 


0.00799 


5.1 





0.00662 


2.9 





40 


0.04948 


1000 


0.001 


0.00879 


5.4 





0.00787 


3.1 





50 


0.06037 


1000 


0.001 


0.01339 


5.3 





0.01179 


3.0 





60 


0.07422 


1000 


0.001 


0.01998 


5.3 





0.01726 


3.0 





70 


0.09325 


1000 


0.001 


0.03057 


5.3 





0.02537 


3.0 





80 


0.11258 


1000 


0.003 


0.04069 


5.5 





0.03408 


3.1 





90 


0.13279 


1000 


0.001 


0.05461 


5.5 





0.04414 


3.0 





100 


0.16051 


1000 


0.001 


0.07260 


5.9 





0.05852 


3.5 






larger number of iterations) than the local methods ItUp and NM. In fact, 
one should keep in mind that the CLA aims not only to find a good solu- 
tion, but also to construct a certificate of global optimality, which is a much 
harder task. Regarding running time, the main computational cost of CLA 
is the spectral decomposition at initialization (see Table 1). The running 
time of the CLA after initialization is actually faster than SA, ItUp and NM 
at higher dimensions (cf. Tables 1 and 3). 

We now make a few quick remarks regarding the implementation of the 
methods. First, all methods do require a matrix inversion routine: SA and 
CLA, only on the first iteration; ItUp and NW, on every iteration. Second, 
in contrast to CLA, ItUp and NW, the calibration of additional parameters 
is needed for the SA. Third, the implementation of SA and ItUp is very 
simple, while the Line Search for NM is slightly more difficult. Fourth, unlike 
the other methods, the CLA involves the additional implementation costs 
associated with the optimality certificate based on ICk , and with the spectral 
decomposition of a positive definite matrix (see also Section 4.2). 

6. Extension to Behrens Fisher-like Problems. It should be noted that 
the methodology proposed in this paper can be applied to a much broader 
class of problems. Strictly speaking, all we need is to be able to replicate 
the strategy of constructing lifted problems whose solution lie on extreme 
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points of a two dimensional convex domain, and to evaluate the subprob- 
lems which define the convex domain. A sufficient condition for this is the 
quasi-concavity of the objective function of the lifted problem and the con- 
vexity of the subproblems. 

To set up a broader framework, assume we have two random samples 
and {Yi}^! whose log-likelihood functions are denoted by h(X; fi, a) 
and l 2 (Y; M) &)■> respectively. The generalized M-estimation problem of inter- 
est is defined as 

max?i(X; //, a) + l 2 (Y; Mi P)- 

A generalization of the subproblem can be cast in terms of the log- 
likelihood functions directly. Assume there exist two monotone (decreasing) 
transformations Tx , Ty ■ M — ► K such that Tx(h(X;-,-)) and Ty (l 2 (Y; •, •)) 
are convex functions. The subproblems, analogous to the EMEP, are 

hx(ui) = mm{T Y (l 2 (Y; fi, (3)) :T x (h(X; fi,a)) < 14.}. 

The geometric results in Section 3 still hold with minor modifications. More- 
over, under the above convexity assumption, the evaluation of hx{ui) can 
be efficiently performed through standard convex programming techniques. 
Therefore, the convergence results of Section 4 are still valid. 

The above framework encompasses the BFP by taking Tx(z) = exp(-^-z) — 
1 and T Y (z) = exp(-^z) - 1. 

We now give a simple example of the application of the methodology 
described above to a Behrens-Fisher-like Problem. 

Example 6.1. Assume X ~ N{[i,Y>) but, differently from the BFP, Y 
follows a multivariate Laplacian distribution, that is, 

fv{y) = c L exp(-||y -n\\), 

where cl is the normalization constant and || • || is the Euclidean norm. The 
related lifted problem can be cast as 

Ni 

min f(u 1 ,u 2 ) = —\og(ui)+u 2 , 
(21) Ul >l + M x (p), 

N 2 
i=l 



^Higher-dimensional convex domains would impose an additional burden in terms of 
computational complexity. 
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Here, the problem objective function is concave (7/1,112), and therefore the 
solution must lie on the border of the convex domain of these variables. Such 
a domain can be written as 

U!>l + M x (fi) ) 

(22) K= { (ui,u 2 ) eR 2 :3n such that 



u 2 >1 + X! \\Yi-fJL\ 



1=1 



Moreover, the associated subproblems, using Tx{z) = exp(-^-z) — 1 and 
Ty(z) = z, are convex programming problems and have the form 

(N 2 <j 

hx(ui) =min|^ ||Yj - : M x (fi) <uA 

and 

/iy(u 2 ) = min|A4^(/i) ||Y ~ HI < u 2 j- 

Both these problems can be solved via convex quadratic programming, which 
can be done quite efficiently even in high-dimensional cases. 



APPENDIX A: NOTATION OF CONVEX ANALYSIS 

Herein we gather the definitions of relevant concepts in Convex Analysis 
for this work. We refer to [24] for an analytic exposition of Convex Analysis 
and to [13] for a more geometric one. 

A set S is convex if, for any x, y G S, a G [0, 1], ax + (1 — a)y G S. An 
extreme point of a convex set is a point that cannot be written as a strictly 
(a < 1) convex combination of any other distinct points in the set. A set 
P is said to be polyhedral if P = {x G M. n : Ax < b}, where A is a matrix, 
and b, a vector. It follows that polyhedral sets are convex and their extreme 
points are its corners. The recession cone Cs of a convex set 5" is the set of 
directions that go to infinity in S, formally, Cs = {d:d + S C S}. 

A function g:M n — > R is said to be convex if, for any x,y € M n , and 
a G [0,1], g(ax + (1 - a)y) < ag(x) + (1 - a)g(y). A function f :R n 
is quasi-concave if, for any x,y G M n , and a G [0,1], f(ax + (1 — a)y) > 
min{/(x), f(y)}, or equivalently, the upper level sets of / are convex sets. 

Given a convex function g : M. n — > R, we can define its subdifferential at x 
as dg{x) = {s G R n : 5(2/) > g(x) + (s, y — x), for all y G R n }. The elements of 
the subdifferential, also called subgradients, play the role of the gradient in 
case g is nondifferentiable. Note that dg{x) is always nonempty. 
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APPENDIX B: SOLVING THE EMEP 

Consider the convex problem in (7). There are a variety of "general pur- 
pose" convergent algorithms that can solve it. Here, we propose a specific 
algorithm tailored for the particular structure of the EMEP. 

Let A be the (nonnegative) Lagrange multiplier associated with the in- 
equality constraint. The first order conditions are necessary and sufficient, 
and are given by 

2S2 1 (Y-fi) + 2AS'f 1 (X - fi) = 

and 

X(MM-vi)=0. 

Assuming that A > (otherwise, the solution is just ju = Y), the optimal 
ju is a function only of A: 

(23) /1(A) = (S^ 1 + A5f 1 )~ 1 (5j 1 y + A5f 1 X). 

Therefore, in order to solve the EMEP, it suffices to compute a root A* of 
the nonlinear univariate function 

(24) m(X)=Mx(MX))-vi. 

The algorithm we propose here is based upon the algorithm in Ye [38], 
who in turn built upon earlier work by Smale [29]. 

Our algorithm is made up of two main parts. The first part consists of 
a binary search over intervals of increasing length to find which interval It* 
contains what Smale [29] calls an approximate root. 

Definition B.l. A point A is said to be an approximate root of an 
analytic real function m : K — > K if 

| A fc + 1 -A fc |<(l/2) 2fc "- 1 |A 1 -A°|. 

In the second part of the algorithm, Newton's method is used over the 
interval Ij* to find the approximate root A*. For the sake of exposition, we 
focus on the case of m : K — > K (the results in [38] hold in much greater 
generality, though). Recall that the Newton iterate for a function m from a 
current point X k is 

A fc+i = x k_ m ( A *0 

m'(X k + 1 )' 

Newton's Method (NM) converges quadratically from the very first iteration. 
In [29] , Smale gives sufficient conditions under which a particular point is an 
approximate root. Although it is hard to verify Smale's condition in general, 
Ye provided a constructive method to find such a point for a particular class 
of functions. Ye's results in [38] apply in our case. We now write out Ye's 
algorithm and prove a complexity result for it in the context of the BFP. 
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Binary Search and Newton Method 



Input: Upper and lower bounds on the value of the root [a, b], b>a>5, 
tolerance 5 > 0. 

Step 1. Define a partition of [a,b] through intervals of the form 

h = [a(l + l/12)\a(l + l/12) l+1 ). 
Step 2. Perform binary search on these intervals to find ij* 
that contains the true root A*. 

Step 3. Let A = a(l + 1/12)% fc = 0. 

Step 4. Perform Newton's method from A fc : A fc+1 <— A fc — ^rjj^Fj ■ 

Step 5. Stop if k > 1 + log 2 (l + max{0, log 2 (6/<5)}) steps. 

Step 6. Else set k*—k + l, and goto Step 4. 



Theorem B.l. After the computation of a spectral decomposition of the 
matrix ' S^ 1 S\^ 2 , and given a desired precision 5 > and an upper bound 
b for the solution, the algorithm finds a 5-approximate solution A such that 
| A* — Aj < 5 in at most 

O^dloglog-^ 

arithmetic operations. 

Proof. Making the following change of variables/notation 
w ■- S~ 1/2 (> - X), M := S{' 2 S^ 1 S{' 2 = PDP T , 
v = 25^ 1 5 1 1/2 (1 > - X) and s = P T v, 
problem (15) is equivalent to 

h(vi) = mm w T Mw — v T w, 
\\w\\ 2 < V\ 

up to a constant value (which does not matter for the optimization). 
Under the new notation, we can rewrite the function m as 

m (A) = s T (D + \iy 2 s -«i = E ( A S + A )2 " ^" 



The function m(A) is analytic and its derivatives can be easily computed as 



m«(A) = (-# + l)!E rn . 
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Note that ml < and m" > (i.e., m is decreasing and convex). Thus, we 
can evaluate m and m' in 0(d) operations. This implies that each Newton 
step can be implemented in 0(d) arithmetic operations. 

Let A = a(l + 1/12)* be the left endpoint of the interval selected by 
binary search. From Ye [38], it follows that A satisfies Smale's sufficient 
condition to be an approximate root. Therefore, NM converges quadratically 
from the very first iteration (i.e., from A ). From the convexity of m, the 
convergence is monotone, that is, < A < X k < X k+1 < A* < b for every k 
(in particular, we have [A 1 — A°| < 6). This implies that we need at most 

k = l + log 2 (l + max{0,log 2 (b/<S)}) 

Newton steps to achieve \X k — A*| < 5. Moreover, the total number of subin- 
tervals is i og (i+i/i2) l°g(V Q )- The binary search can thus be implemented in 
0(loglog(6/a)). The result follows by noting that we can take a > 6. □ 

Remark B . 1 . Even when we need to solve the EMEP for many different 
levels of the Mahalanobis distance function, the spectral decomposition of 

1/2 —1 1/2 

Si S 2 Si needs to be performed only once. This feature of the algorithm 
makes it a good auxiliary method for the CLA. 

The following lemma illustrates how to obtain subgradients for the func- 
tion hx with no additional computational effort, which is of interest for the 
CLA. 

Lemma B.l. Let A* be a root of the function m as defined in (24). Then 
—A* is a subgradient of hx at v\. 

PROOF. Recall m(X*) = implies that /i(A*) minimizes My {^) + \*M X (/i). 
For any v, we have 

h x {vi) = M Y (H\*)) = M Y (n(X*)) + X(M X (MX*)) - Vl) 

= M Y (fi(X*)) + X*(M x (n(X*)) -v)+X*(v- Vl) 

<h x (v) + X*(v-vi). 

Here, we used weak duality (minmax > maxmin) as follows: 
hx(v) = minmax My(u) + \ (Mv(ll) — v) 

> maxmin My (a) + X(M via) — v) 
~ A>0 M 

>M Y (n(x*))+\*(M x (m)-v). 



30 



A. BELLONI AND G. DIDIER 



Therefore, for every v, we have 

hx(vi) - X*(v - vi) < h x (v), 
which implies that —A* £ dhx{v\). □ 

APPENDIX C: THE DISCRETIZATION ALGORITHM (DA) 

Consider the problem (5) for a fixed value of u\ = u\. In this case, the 
computational problem reduces exactly to solving the EMEP with respect 
to A at a fixed squared distance level u\ — 1. As shown in Section 2, such a 
problem can be solved directly with the algorithm proposed in Appendix B. 

Therefore, given the desired precision, one can discretize the range of the 
variable ui, [Li, Ui], and solve the EMEP for each one of these values. Such 
a scheme yields the following algorithm. 

Discretization Algorithm 

Input: Relative tolerance e > 0, u\ = (1 + 2e/N\)Li, k = 1. 
Step 1. Evaluate u\ = g(u\) and compute f k = ^ log(u k ) + ^ log^). 
Step 2. If (1 + 2e/Nx)u\ > U u compute f k+l = C7iL 2 , goto Step 4. 
Step 3. Else set u k+1 *- (1 + 2e/Ni)u k , k k + 1, goto Step 1. 
Step 4. Report mini<j<fc /* and the correspondent pair (u|*,U2*). 

The following complexity result holds for the Discretization Algorithm. 

Theorem C.l. The Discretization Algorithm reports an e -solution for 
the original problem after exactly [log(C/i/Li)/log(l + 2e/Ai)] loops. 

Proof. Let u* = (uf,^) be a optimal solution. There exists a k such 
that u k < u\ < (1 + 2e/Ni)u k . We consider f k+1 as our candidate. We have 

Ai No 
r = _l lo g K) + _^ lo g (n *) 

(25) < f k+1 = ^ log(l + 2S/N,) + ^ log(^) + ^ log(^ +1 ) 

< e + ^ log(4) + ^ log(^ +1 ) = e + /* , 

where we also used that u k+1 < u\, since g in (14) is decreasing. 

The claim on the number of loops follows by noting that we have u k = 
Li(l + 2e/Ni) k < Ui and by taking logs to bound k. □ 

By choosing a sequence — ► 0, we obtain a sequence of £fc-solutions that 
converge to the optimal solution of the BFP. One drawback to this method 
is that it requires solving the EMEP at every point of the discretization. In 
practice, such requirement may be cumbersome. 
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APPENDIX D: MONTE CARLO STUDY OF SIZE 
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Table 4 

Monte Carlo study 0} size for the W, LR, LM and B Tests (runs per entry — 10,000) 



„ „ . Size of the Test a 
Small size 



instances a = 0.10 a = 0.05 a = 0.01 



d JVi N 2 W LR LM B W LR LM B W LR LM 



2 10 20 

2 20 40 

2 40 80 

5 25 50 

5 50 100 

5 100 200 

10 50 100 

10 100 200 

10 200 400 



0.160 0.133 

0.138 0.122 

0.120 0.114 

0.171 0.133 

0.124 0.110 

0.113 0.106 

0.175 0.131 

0.137 0.118 

0.116 0.107 



0.102 0.092 
0.106 0.103 
0.110 0.107 
0.098 0.090 
0.094 0.090 
0.098 0.096 
0.094 0.084 
0.099 0.093 
0.100 0.098 



0.106 0.077 

0.081 0.067 

0.069 0.061 

0.101 0.073 

0.068 0.055 

0.063 0.057 

0.102 0.072 

0.074 0.062 

0.062 0.056 



0.046 0.046 

0.050 0.051 

0.055 0.054 

0.047 0.046 

0.041 0.041 

0.053 0.052 

0.044 0.041 

0.047 0.047 

0.051 0.051 



0.039 0.020 

0.023 0.013 

0.017 0.014 

0.035 0.019 

0.017 0.011 

0.015 0.013 

0.035 0.018 

0.019 0.012 

0.014 0.011 



0.005 0.009 
0.007 0.009 
0.011 0.011 
0.005 0.008 
0.007 0.008 
0.010 0.010 
0.008 0.008 
0.009 0.009 
0.009 0.009 



, „ ,. . Size of the Test a 
Medium size 



instances a = 0.10 a = 0.05 a = 0.01 



d Ni N2 W LR LM B W LR LM B W LR LM 



25 


125 


250 


0.196 0.141 


25 


250 


500 


0.137 0.109 


25 


500 


1000 


0.120 0.110 


50 


250 


500 


0.232 0.158 


50 


500 


1000 


0.147 0.117 


50 


1000 


2000 


0.126 0.111 


75 


375 


750 


0.262 0.170 


75 


750 


1500 


0.166 0.131 


75 


1500 


3000 


0.133 0.119 


100 


500 


1000 


0.284 0.175 


100 


1000 


2000 


0.175 0.134 


100 2000 


4000 


0.139 0.117 



0.096 0.077 0.118 0.079 0.043 

0.088 0.078 0.071 0.056 0.039 

0.099 0.091 0.064 0.055 0.049 

0.096 0.065 0.144 0.089 0.040 

0.091 0.079 0.083 0.061 0.044 

0.100 0.092 0.070 0.062 0.053 

0.098 0.064 0.167 0.098 0.048 

0.097 0.084 0.098 0.072 0.048 

0.102 0.092 0.073 0.064 0.053 

0.090 0.054 0.179 0.097 0.043 

0.101 0.076 0.104 0.071 0.047 

0.099 0.087 0.073 0.061 0.050 



0.034 0.035 0.018 0.007 0.007 
0.036 0.019 0.014 0.008 0.007 
0.046 0.015 0.011 0.009 0.008 
0.027 0.041 0.018 0.005 0.005 
0.038 0.021 0.013 0.008 0.007 
0.049 0.016 0.012 0.011 0.010 
0.029 0.059 0.025 0.008 0.004 
0.038 0.025 0.016 0.009 0.007 
0.049 0.018 0.014 0.010 0.009 
0.025 0.060 0.025 0.007 0.004 
0.036 0.026 0.016 0.008 0.006 
0.042 0.017 0.013 0.009 0.007 



T . Size of the Test a 
Large size 



instances a = 0.10 a = 0.05 a = 0.01 



d Ni N 2 W LR LM B W LR LM B W LR LM 



200 1000 2000 0.373 0.213 0.095 0.040 0.251 0.123 0.043 0.015 0.101 0.030 0.007 0.002 
200 2000 4000 0.203 0.136 0.085 0.060 0.112 0.073 0.042 0.029 0.032 0.016 0.009 0.005 
200 4000 8000 0.153 0.128 0.099 0.085 0.084 0.064 0.049 0.039 0.019 0.014 0.010 0.007 
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